Medical Decision Making
○ SAGE Publications
Preprints posted in the last 90 days, ranked by how well they match Medical Decision Making's content profile, based on 10 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit.
Gracia, V.; Goldhaber-Fiebert, J. D.; Alarid-Escudero, F.
Show abstract
PurposeWe introduce PRE-CISE, a pre-calibration workflow that integrates coverage analysis, local sensitivity, and collinearity diagnostics to streamline model calibration and transparently address nonidentifiability. We demonstrate the benefits of PRE-CISE using a four-state Sick-Sicker Markov testbed and a COVID-19 case study. MethodsPRE-CISE begins with a coverage analysis to verify that model outputs generated with parameter sets drawn from their prior distribution span calibration targets, followed by local sensitivities to quantify the influence of parameters on model outputs, guiding the resizing of the prior distribution bounds to improve coverage. Identifiability is then assessed via collinearity analysis; large indices indicate practical nonidentifiability. For the testbed model, we calibrated 3 parameters to survival, prevalence, and the proportion of Sick to Sicker at 10, 20, and 30 years. For the COVID-19 model, we calibrated 11 parameters to match daily confirmed incident cases. Bayesian calibration was conducted on both analyses. ResultsCoverage analyses flagged initial misfits; local sensitivities identified the Sick-to-Sicker transition probability has a greater effect on model outputs, and resizing its prior distribution bounds improved coverage. Collinearity analyses showed that combining multiple calibration targets across time points enabled recovery of all three parameters. In the COVID-19 model, local sensitivity analyses prioritized time-varying detection rates and contact-reduction effects, reducing the search space, thereby improving calibration efficiency. Daily incident case calibration targets yielded collinearity indices below practical thresholds (e.g., < 15) for all parameter combinations, whereas weekly calibration targets were larger and closer to the cutoff. ConclusionsPRE-CISE provides a practical, transparent pathway that helps modelers refine prior distribution bounds and calibration targets before intensive calibration, improving uncertainty reporting and strengthening the reliability of model-based health policy analyses.
ORWA, F. O.; Mutai, C.; Nizeyimana, I.; Mwangi, A.
Show abstract
When randomized controlled trials are impractical, interrupted time series designs offer a rigorous quasi-experimental approach to assess population level policies. Indeed, in the context of quasi-experimental designs (QEDs), the Interrupted Time Series (ITS) method is commonly thought of as the most robust. But interrupted time series designs are susceptible to serial correlation and confounding by time-varying factors associated with both the intervention and the outcome, which may result in biased inference. Thus, we provide a simulation-based contrast of controlled interrupted time series (CITS) and multivariable regression (multivariable negative binomial regression) for estimation of policy effects in count time series data. These approaches are widely used in policy evaluations, yet their comparative performance in typical population health settings has rarely been examined directly. We tested both approaches within a variety of data generating situations, differing in the series length, intervention effect size, and magnitude of lag-1 autocorrelation. Bias, standard error calibration, confidence interval coverage, mean squared error, and statistical power were assessed for performance. Both methods gave unbiased estimates for moderate and large intervention effects, although bias was more pronounced for small effects, particularly in short series. Although the point estimate performance was similar, inferential properties varied significantly. CITS always had smaller mean squared error, better consistency between model based and empirical standard errors, and confidence interval coverage near the 95% nominal levels over weak to moderate autocorrelation. By contrast, multivariable regression was more sensitive to serial dependence, leading to underestimated standard errors and undercoverage, especially at moderate to high autocorrelation, regardless of Newey-West adjustments. These findings show the benefits of using a concurrent control series and the importance of structurally accounting for serial correlation when studying population level policies with time series data.
Hassoon, A.; Peng, X.; Irimia, R.; Lianjie, A.; Leo, H.; Bandeira, A.; Woo, H. Y.; Dredze, M.; Abdulnour, R.-E.; McDonald, K. M.; Peterson, S.; Newman-Toker, D.
Show abstract
BackgroundDiagnostic errors are a leading cause of preventable patient harm, often occurring during early clinical encounters where diagnostic uncertainty is maximal. Large language models (LLMs) have shown potential in medical reasoning, yet their ability to function as a diagnostic safety net, specifically by identifying and correcting human diagnostic errors, remains systematically unquantified. We evaluated whether state-of-the-art LLMs can effectively challenge, rather than merely confirm, an erroneous physician diagnosis. MethodsWe evaluated 16 leading LLMs (including GPT-o1, Gemini 2.5 Pro, and Claude 3.7 Sonnet) using 200 standardized clinical vignettes representing 20 high-stakes, frequently misdiagnosed conditions. Models were presented with the full clinical record and an incorrect physician diagnosis. Primary outcomes included the diagnostic correction rate (disagreeing with the error and providing the correct diagnosis) and the ratio of correction to error detection. We further tested model robustness by generating 2,200 variants to assess the influence of demographic (race/ethnicity) and contextual (institutional reputation, training level, insurance) tokens. ResultsDiagnostic correction rates varied significantly across models. Gemini 2.5 Pro demonstrated the highest performance, correcting the physicians error in 55.0% of cases (n=110/200), followed by Claude Sonnet 3.5 (48.5%) and Sonnet 4 (47.0%). In contrast, DeepSeek V3 corrected only 20.0% of cases. Performance was strikingly consistent at the disease level; most models failed to correct errors in syphilis, spinal epidural abscess, and myocardial infarction. Furthermore, several models exhibited confirmation bias (agreeing with the incorrect diagnosis) occurring in 11.0% to 50.0% of cases. Stability across demographic and contextual variants was inconsistent, with some models showing spurious performance shifts based on non-clinical tokens. ConclusionWhile top-performing LLMs can intercept approximately half of the human diagnostic errors in high-stakes scenarios, performance is heterogeneous and highly sensitive to non-clinical context. Current models exhibit significant disease-specific gaps and a tendency toward confirmation bias, suggesting that their safe clinical integration requires adversarial, multi-agent workflows designed to prioritize skepticism over baseline agreement.
Jafari, H.; Chu, P.; Lange, M.; Maher, F.; Glen, C.; Pearson, O. J.; Burges, C.; Martyn, M.; Cross, S.; Carter, B.; Emsley, R.; Forbes, G.
Show abstract
Background: Statistical Analysis Plans (SAPs) are essential for trial transparency and credibility but are resource-intensive to produce. While Large Language Models (LLMs) have shown promise in drafting protocols, their ability to generate high-quality, protocol-compliant SAPs remains untested against current content guidance. This study developed and validated an LLM-based pipeline for drafting SAPs from clinical trial protocols. Methods: We developed a structured, section-by-section prompting pipeline aligned with standard SAP guidance. We applied this pipeline to nine clinical trial protocols using three leading LLMs: OpenAI GPT-5, Anthropic Claude Sonnet 4, and Google Gemini 2.5 Pro. The resulting 27 SAPs were evaluated against a 46-item quality checklist derived from the published SAP guidelines. Items were double-scored by independent trial statisticians on a 0 to 3 scale for accuracy. We compared performance across LLMs and between item types (descriptive vs. statistical reasoning) using mixed-effects logistic regression. Results: Across 9 trials, the models produced SAP drafts with high overall accuracy (77% to 78%), with no difference in performance between the three LLMs (p=0.79) but varied by content type (p < 0.001). All models performed well on descriptive items (e.g., administrative details, trial design), with lower accuracy for items requiring statistical reasoning (e.g., modelling strategies, sensitivity analyses). Accuracy for statistical items ranged from 67% to 72%, whereas descriptive items achieved 81% to 83% accuracy. Qualitatively, models were prone to specific failure modes in complex sections, such as omitting necessary details for secondary outcome models or hallucinating sensitivity analyses. Discussion: Current LLMs can effectively draft portions of SAPs, offering the potential for substantial time savings in trial documentation. However, a human-in-the-loop approach remains mandatory; while models demonstrate strong capability in producing descriptive content, their independent application to complex statistical methodology design still requires further methodological development and training. Future work should explore advanced prompt engineering, such as retrieval-augmented generation or agentic workflows, to improve reasoning capabilities.
Nichols, B. E.; Wonderly Trainor, B.; Hampson, G.; Grad, Y. H.; Klausner, J. D.
Show abstract
Background: Rising antimicrobial resistance in Neisseria gonorrhoeae threatens the effectiveness of existing therapies. Resistance-guided treatment (RGT) may reduce treatment failures, complications, and inappropriate use of last-line agents while slowing resistance emergence. Methods and Findings: We developed an individual-level stochastic simulation model of gonorrhea diagnosis and treatment in the United States, incorporating infection prevalence, symptom status, diagnostic accuracy, resistance profiles, treatment pathways, and partner management (costs in 2025 USD). We evaluated three resistance testing strategies, ciprofloxacin-only, ciprofloxacin+ceftriaxone, and triple-target (including a novel drug A), across a wide range of resistance scenarios. We quantified economic value across three dimensions: (1) per-episode direct medical cost savings, (2) system-level costs attributable to ceftriaxone resistance emergence among MSM, and (3) avoided costs of new antibiotic development, estimating the maximum per-test price at which RGT remains cost-neutral. Per-episode cost-neutrality thresholds ranged from near $0 when ceftriaxone resistance was absent to up to $45/test at 15% ceftriaxone resistance. At 50% ciprofloxacin and 5% ceftriaxone resistance, the population-weighted threshold was $4 (95% UI:$3-$8) for a CIP-only test and $11 (95% UI:$5-$14) for a triple-target test. Among MSM, incorporating system-level resistance emergence costs and avoided antibiotic development costs increased the total per-test value to $35-$145 for a single-target test and $84-$128 for a triple-target test, depending on whether prescribing practices shift when ceftriaxone resistance reaches 5%. Conclusions: Resistance-guided therapy offers economic benefits across multiple dimensions even at relatively high diagnostic prices, supporting investment in gonorrhea resistance testing to improve partner outcomes, delay resistance emergence, and enhance the long-term cost-efficiency of gonorrhea management.
Janoudi, G.; Rada (Uzun), m.; Yasinov, E.; Richter, T.
Show abstract
BackgroundHealth technology assessment (HTA) agencies issue reimbursement recommendations that determine patient access to new therapies. Predicting these outcomes would enable sponsors to optimize market access strategies and health systems to anticipate budget impacts. However, traditional machine learning approaches require extensive manual feature extraction and predict only categorical outcomes, not the specific conditions attached to recommendations. MethodsWe developed Monte Carlo Committee Simulation, a neurosymbolic system that simulates multi-panelist deliberation using 14 persona-conditioned large language model panelists with weighted voting and uncertainty quantification. We conducted a temporal external validation study on CDA-AMC (Canadas Drug Agency) sponsor-submitted recommendations published between October 2024 and December 2025 (n=67), after the knowledge cutoff of the underlying models, ensuring predictions reflected reasoning rather than memorization. The system predicted both recommendation category (Reimburse with Conditions, Do Not Reimburse) and five condition categories (Population Restrictions, Prescriber/Setting Requirements, Continuation Conditions, Economic Conditions, Evidence Conditions). ResultsOn submissions where the system expressed confidence (n=44), recommendation prediction achieved 93.2% accuracy (95% CI: 84.1-100.0%), exceeding the 91.8% (95% CI: 83.7-98.0%) majority class baseline. The system demonstrated superior discrimination versus chance level (AUROC 0.817, 95% CI: 0.45-0.99, vs 0.500) and calibrated confidence estimates (ECE = 0.091). Pre-specified Strength of Mandate stratified accuracy from 96.8% (High, 95% CI: 90.3-100.0%) to 40.0% (Weak, 95% CI: 0.0-80.0%), with 83.3% of errors occurring in cases flagged as uncertain (p=0.0025). Analysis of the 5 abstained cases confirmed 40.0% accuracy, validating the systems identification of uncertain predictions. For condition prediction, the system achieved 48.8% subset accuracy, requiring correct simultaneous prediction of all 5 condition categories (25 = 32 possible combinations), and 86.3% Hamming accuracy versus 25.8% for a no-conditions baseline. Per-category accuracy ranged from 68.3% (Continuation Conditions) to 97.6% (Economic Conditions), with Continuation Conditions demonstrating the strongest discriminative ability (AUROC 0.896, 95% CI: 0.79-0.98). ConclusionsMonte Carlo Committee Simulation enables a shift from reactive to proactive market access: anticipating specific reimbursement conditions before committee review, with calibrated confidence that identifies which predictions to trust. Validated on temporally separated data the models could not have memorized, the system can be positioned as a forecasting aid that complements rather than replaces human deliberation.
Gibson, A. D.; White, N. M.; Collins, G. S.; Barnett, A.
Show abstract
Clinical prediction models are often created using large routinely collected datasets. It is essential that prediction models are developed with appropriate data and methods and transparently reported to ensure that decisions are based on reliable predictions. Kaggle is a popular competition website where users learn and apply analysis skills on a range of datasets. We identified two large, publicly available Kaggle datasets, on stroke and diabetes, that lack clear data provenance, but are widely used in clinical prediction models in peer-reviewed publications. The authenticity of both datasets could not be verified and have evidence they are likely to be simulated or fabricated. Data provenance assessment using nine TRIPOD+AI items revealed major deficiencies, with minimal details for either dataset including no information on when, where, why or how the data were collected. From these two datasets, we found 124 clinical prediction model studies. Three prediction models had evidence of use in clinical practice, one model was cited in a medical device patent, and the models were cited in 86 review articles. We recommend that journals and data repositories mandate data provenance reporting to safeguard published research. Prediction models based solely on simulated or fabricated data sets should never be used to directly inform decisions on patient care.
Demdiont, A. C.
Show abstract
Algorithmic decision systems mediate access to healthcare, credit, employment and housing, yet individuals who experience adverse decisions face multi-stage barriers when seeking recourse. We formalize these barriers as a series-structured system with 11 empirically parameterized stages across three layers (data integration, data accuracy and institutional access) and prove that single-barrier interventions are bounded by baseline system success. Under baseline parameterization derived from federal datasets and peer-reviewed algorithmic audit studies, end-to-end recourse probability is 0.0018%. Removing any single barrier yields negligible improvement (<0.02%). Factorial decomposition reveals that the three-way cross-layer interaction accounts for 87.6% of achievable improvement, confirmed by Shapley attribution, Sobol sensitivity analysis and bootstrap resampling (n = 1,000). These results provide a structural explanation for the limited impact of incremental reforms and support coordinated multi-layer intervention approaches for clinical AI governance and algorithmic fairness.
King, J. B.; Derington, C. G.; Xu, S.; Clark, N. P.; Reynolds, K.; An, J.; Witt, D. M.; O'Keeffe Rosetti, M.; Lang, D. T.; Ho, P. M.; Ozanne, E. M.; Bellows, B. K.
Show abstract
BackgroundPharmacist-led anticoagulation management services (AMS) for direct oral anticoagulants (DOACs) reduce prescribing errors and enhance adherence, but have not demonstrated lower rates of stroke or bleeding compared to usual care, and their cost-effectiveness is unknown. We evaluated four anticoagulant strategies for patients with atrial fibrillation initiating therapy: warfarin AMS, DOAC usual care, DOAC population management tool (PMT), and DOAC AMS. MethodsWe developed a Markov model with monthly cycles simulating lifetime risk of ischemic stroke, major bleeding, death, disability, and costs from a US healthcare sector perspective. Costs and outcomes were discounted 3% annually. Model probabilities were derived from a prior Kaiser Permanente comparative-effectiveness analysis. Other inputs from published literature and national data. Primary outcomes were direct healthcare costs (2025 USD), quality-adjusted life years (QALYs), and incremental cost-effectiveness ratios (ICERs). Sensitivity analyses assessed parameter uncertainty. ResultsDOAC-based strategies yielded greater QALYs than warfarin AMS and were cost-effective at standard willingness-to-pay thresholds. Compared with warfarin AMS, DOAC usual care gained 0.4 QALYs (ICER $89,200/QALY), DOAC PMT gained 0.6 QALYs (ICER $66,700/QALY), and DOAC AMS gained 0.6 QALYs (ICER $64,500/QALY). DOAC usual care and DOAC PMT were extendedly dominated by DOAC AMS. At $120,000/QALY, DOAC AMS was preferred in 50.4% of probabilistic iterations, DOAC PMT in 36.3%, DOAC usual care in 11.0%, and warfarin AMS in 2.3%. Results were most sensitive to DOAC program effectiveness and DOAC costs. ConclusionsPharmacist-led DOAC management is cost-effective compared with warfarin AMS for AF patients. These findings support broader adoption of structured DOAC management programs to optimize anticoagulation therapy.
Mehran, R. J.; Kuriyan, J.
Show abstract
ImportancePrevention-focused health policy requires analytic frameworks capable of detecting changes in population health and associated costs within policy-relevant time horizons, particularly in managed care systems where premiums reflect actuarial risk rather than realized medical expenditures. ObjectiveTo evaluate a healthstate-based analytic framework (CareMaps) for measuring population health dynamics, disease progression, and associated costs using longitudinal Medicaid managed care claims data. Design, Setting, and ParticipantsRetrospective longitudinal analysis of deidentified Medicaid managed care claims in New Mexico from 2011 through 2014. The study included individuals aged 0 to 64 years enrolled in managed care plans. ExposuresChronic disease burden categorized into mutually exclusive, ordered healthstates based on the number of chronic conditions. Main Outcomes and MeasuresCounty- and managed care organization (MCO) level prevalence of healthstates, transition rates between healthstates, and healthstate-specific cost estimates derived from capitation premiums and medical loss ratio defined medical expenditures. ResultsThe CareMaps framework identified specific geographic and MCO level variation in chronic disease prevalence, healthstate transition rates, and per-member spending patterns that were not fully explained by actuarial risk adjustment. Transitions from nonchronic to chronic healthstates varied markedly across counties, indicating heterogeneity in disease progression and prevention related outcomes. Conclusions and RelevanceA healthstate based analytic framework applied to longitudinal Medicaid managed care data enables standardized measurement of population health dynamics and associated costs within policy relevant time horizons. Such approaches may support evaluation of preventive care performance, inform risk adjustment, and enhance public-sector oversight of managed care programs.
Mhatre, P.; von Rosenvinge, L.; Suresh, A.; Patzkowsky, K.; Frost, A.; Vargas, M. V.; Wu, H.; Wang, K.; Simpson, K.; Segars, J.; Singh, B.
Show abstract
BackgroundUterine fibroids cause significant morbidity, psychosocial stress, and poor quality of life due to symptoms including heavy menstrual bleeding, anemia, pain, and bulk symptoms, as well as reproductive complications including infertility, early pregnancy loss, and preterm birth. Fibroids represent a 42.2 billion USD annual economic burden to the United States healthcare system. Despite reported delays in diagnosis of fibroids even in symptomatic women, clinical guidelines do not recommend screening for fibroids. High risk patient groups are well known. Earlier detection of fibroids through ultrasound screening could allow for earlier intervention with secondary prevention strategies or less invasive treatment options and improve the quality of life of women living with fibroids. ObjectiveThe study aimed to evaluate the cost-effectiveness of annual ultrasound screening for fibroids in women aged 25-54 years in the United States. Study DesignIn this economic evaluation, conducted in January-February 2026, a decision-analytic Markov model was developed using a healthcare payer perspective to analyze the cost-effectiveness of ultrasound screening for women in the United States. The time horizon was 25 to 55 years of age. Costs were adjusted for inflation to 2025 average according to the yearly medical care index of the United States consumer price index. Discounting (3% per cycle) and half-cycle corrections were calculated. Deterministic and probabilistic sensitivity analyses were performed to explore uncertainty, analyzed using TreeAge Pro Healthcare software. Model variables were obtained from published literature. All women residing in the United States aged 25-54 years were assumed to have been invited to the screening program. ResultsUltrasound screening for fibroids for women was found to be not only cost-effective but also cost-saving, with an incremental cost-effectiveness ratio (ICER) of -$56,605.631 per QALY (quality-adjusted life-year) gained in the base-case analysis, at a willingness to pay threshold of $30,000 per QALY. Ultrasound screening was cost-effective at all starting ages from 25 to 54 years, with even greater benefit at younger ages. Sensitivity analyses demonstrated the robustness of these findings across a wide range of variable ranges. Ultrasound screening for fibroids showed a cumulative potential to save $1,169 billion and increase 20.7 million QALYs per year compared to no screening for a population of 63.89 million American women between 25 and 54 years old. The subset of 9.32 million Black American women experienced greater benefits, with potential savings of 183 billion and an increase of 3 million QALYs. ConclusionBased on the model-based analysis, annual ultrasound screening for uterine fibroids for women aged 25-54 years in the United States was cost-effective and cost-saving, even more so for Black women. These model-based findings highlighted the potential value of guidelines for annual ultrasound screening for fibroids, which could enable earlier diagnosis, secondary prevention, and timely intervention, with positive impact on both quality of life and healthcare costs. Tweetable StatementAnnual ultrasound screening for uterine fibroids in U.S. for women aged 25-54 years was cost-effective and cost-saving. Study at a GlanceO_ST_ABSA. Why was this study conducted?C_ST_ABSO_LITo evaluate whether annual ultrasound screening for fibroids in women aged 25-54 years in the U.S. is cost-effective. C_LI B. What are the key findings?O_LIAnnual ultrasound screening beginning at 25 years was both cost-effective and cost-saving, with an ICER of -$56,605.631/QALY for women in the US. C_LIO_LIScreening resulted in potential savings of $1,169 billion for US healthcare payers and 20.7 million QALYs for U.S. women. C_LI C. What does this study add to what is already known?O_LIAnnual ultrasound screening for fibroids is not only cost effective but also cost saving, highlighting its potential to reduce diagnostic delays and enable earlier, less invasive interventions. C_LIO_LIThe results support development and implementation of fibroid screening guidelines. C_LI
Liu, C.; Mayer, M.; Lactaoen, K.; Gomez, L.; Weissman, G.; Hubbard, R.
Show abstract
Hybrid controlled trials (HCTs) incorporate real-world data into randomized controlled trials (RCTs) by augmenting the internal control arm with patients receiving the same treatment in routine care. Beyond increasing power, HCTs may improve recruitment by supporting unequal randomization ratios that increase patient access to experimental treatments. However, HCT validity is threatened by bias from unmeasured confounding due to lack of randomization of external controls, leading to outcome non-exchangeability between internal and external control patients. To address this challenge, we developed a sensitivity analysis framework to assess the robustness of HCT results to potential unmeasured confounding. We propose a tipping point analysis that adapts the E-value framework to the HCT setting where trial participation rather than treatment assignment is subject to confounding. To aid interpretation, we also introduce a data-driven benchmark representing the strength of unmeasured confounding reflected by the observed outcome non-exchangeability. We then propose an operational decision rule and evaluate its performance through simulation studies. Finally, we illustrate the approach using an asthma trial augmented by data from electronic health records. Simulation results demonstrate that our decision rule safeguards against Type I error inflation while preserving the power gains achieved by incorporating external data. In settings where moderate unmeasured confounding led to poorer outcomes for external controls, Type I error was controlled near the nominal 5% level, and power increased by 10-20% compared with analyses using RCT data alone. Our approach provides a practical, interpretable method to assess HCT robustness, supporting rigorous inference when integrating external real-world data.
Schwoebel, J.; Frasch, M.; Spalding, A.; Sewell, E.; Englert, P.; Halpert, B.; Overbay, C.; Semenec, I.; Shor, J.
Show abstract
As health systems begin deploying autonomous AI agents that make independent clinical decisions and take direct actions within care workflows, ensuring patient safety and care quality requires governance standards that go beyond existing medical device frameworks designed for human-in-the-loop prediction tools. This paper introduces the Healthcare AI Agents Regulatory Framework (HAARF), a comprehensive verification standard for autonomous AI systems in clinical environments, developed collaboratively with 40+ international experts spanning regulatory authorities, clinical organizations, and AI security specialists. HAARF synthesizes requirements from nine major regulatory frameworks (FDA, EU AI Act, Health Canada, UK MHRA, NIST AI RMF, WHO GI-AI4H, ISO/IEC 42001, OWASP AISVS, IMDRF GMLP) into eight core verification categories comprising 279 specific requirements across three risk-based implementation levels. The framework addresses critical gaps in health system readiness for autonomous AI including: (1) progressive autonomy governance with clinical accountability, (2) tool-use security for agents that independently access EHRs, medical devices, and clinical systems, (3) continuous equity monitoring and bias mitigation across diverse patient populations, and (4) clinical decision traceability preserving human oversight authority. We validate HAARFs enforcement capabilities through a scenario-based red-team evaluation comprising six adversarial scenarios executed under baseline (no middleware) and HAARF- guardrailed conditions (N = 50 trials each, Gemini 2.5 Flash primary with Claude Sonnet 4.6 cross-model validation). In baseline conditions, the agent model executes unauthorized tools in 56-60% of adversarial trials. Under the HAARF condition, deterministic middleware enforcement reduces the unauthorized-tool success rate to 0%, with 0% contraindication misses and 0% policy-injection success (95% Wilson CI [0.00, 0.07]). Cross-model validation confirms identical security metrics, supporting HAARFs model-agnostic design. Mapping analysis demonstrates 48-88% coverage of major regulatory frameworks, with per-category FDA alignment ranging from 73% (C5, Agent Registration) to 91% (C3, Cybersecurity; C7, Bias & Equity). Initial validation with healthcare organizations shows a 40-60% reduction in multi-jurisdictional compliance burden and improved clinical safety governance outcomes. HAARF provides health systems with a practical, risk-stratified pathway for safe AI agent deployment--shifting from reactive compliance to proactive quality governance while maintaining rigorous patient safety standards and human-centered care principles.
Maitreyi, L.; Rajagopal, S.; Anandkumar, A.; Datta, S.
Show abstract
India faces a mounting health crisis from antibiotic resistance, coupled with global pharmaceutical hesitancy to invest in novel antibiotic research and development (R&D), driven by complex scientific and financial hurdles. India carries one of the worlds largest absolute burdens of drug-resistant infections. The combination of a huge infectious-disease caseload, rapid urbanisation, and gaps in sanitation and primary care means that, when resistance emerges, it affects far more patients and generates a much larger pool of patients needing advanced antibiotics than in many high-income countries. Against this backdrop, demand for truly novel, broad-spectrum antibiotics in India is surging, fueled by rising multidrug-resistant infections, overstretched hospitals, and an antibiotic resistance market projected to grow rapidly over the next decade. Most countries respond with incentives and subscription models, for India, the answer lies in bold, innovative revenue strategies and in prioritising the domestic launch of novel antibiotics. This paper presents an econometric analysis of estimated valuation for a novel broad-spectrum antibiotic in India that, as a single therapeutic agent, can address several major hospital-acquired infections, including complicated urinary tract infections (cUTI), hospital-acquired pneumonia (HAP), and ventilator-associated pneumonia (VAP). The model focuses on a hypothetical "ideal" broad-spectrum intravenous antibiotic, and recommends that India pioneer market entry, highlighting financial models that maximise early revenues while still hardwiring stewardship. Launching new antibiotics first in India can catalyse robust real-world use, strengthen domestic pharma, and demonstrate that the economics of antibiotic innovation are viable. This decisive shift can transform India from a passive recipient of ageing drugs into the crucible where the next generation of life-saving antibiotics is forged, anchoring antibiotic research at the core of the countrys health security and economic resilience.
Bowen, H. P.; O'Loughlin, G.; Drake, C.; Schleicher, C.; Schulthess, D.
Show abstract
BackgroundThe Most Favored Nation (MFN) policy is a mechanism that incorporates foreign prices to determine the maximum allowable net price for any branded drug within US government-funded healthcare. Two proposed rules, the Global Benchmark for Efficient Drug Pricing ("GLOBE") (90 Fed. Reg. 60,244) for Medicare Part B and the Guarding US Medicare Against Rising Drug Costs ("GUARD") (90 Fed. Reg. 60,338) for Medicare Part D, invoke the Center for Medicare and Medicaid Innovation Centers payment and service model demonstration and waiver authority, under Section 1115A of the Social Security Act (42 U.S.C. [§] 1315a), to calculate the US MFN price which is the lowest average price within a basket of specified foreign countries. Unlike voluntary manufacturer agreements, GLOBE and GUARD would mandate participation from all applicable manufacturers. MethodsWe derive MFNs potential impact on Medicare pricing from a proprietary dataset provided by IQVIA which contained net prices for the top 37 oncology products by total US sales from January 1, 2019 through June 30, 2025 ranked by total US sales in the following countries: Australia, Belgium, France, Germany, Ireland, Italy, South Africa, Spain, Switzerland, the UK, and the US. For each drug, we select the lowest GDP-adjusted international price from a basket of those countries within 60% of the US GDP per capita, adjusted for purchasing power parity, and calculate the reduction in US price required to match its MFN price, and hence the corresponding reduction in revenues under MFN. A retrospective Net Present Value (NPV) analysis is then used to address the counterfactual question of whether each drug would have been developed had MFN pricing been in place at the time of its FDA approval. ResultsUnder MFN, the average reduction in US prices across our drug cohort was 67%. Eighty-four percent of the 37 cancer drugs in our cohort evidenced a negative NPV if MFN had been in place at the time of their FDA approval and the commercial market is impacted. When the analysis is restricted to MFNs impact on Medicare, the indications for these lost drugs have a total US population of 2.4 million patients. When the analysis is combined across the Medicare and commercial markets, the loss of lead indications impacts over 15 million US patients. ConclusionsMandatory MFN policies reduce the financial incentives required to develop cancer medicines; our projections show a substantial decline in new cancer drug launches and will likely lead companies to pursue indications for populations outside Medicares authority. If so, MFN will reduce the number of new therapies for the very population the Executive Orders are allegedly designed to aid: the Medicare-aged population who require effective new therapies in areas of high unmet medical need, such as late-stage cancers. This creates the perverse outcome of a policy nominally designed to help Medicare beneficiaries by instead redirecting innovation away from their most urgent therapeutic needs.
Jamieson, L.; Venter, W. D. F.; Meyer-Rath, G.
Show abstract
IntroductionDolutegravir-based first-line antiretroviral therapy (tenofovir disoproxil fumarate, lamivudine, and dolutegravir; TLD) has delivered substantial clinical and public health benefits. However, sharply decreasing funding for HIV programmes necessitates cost reduction within current treatment guidelines. We evaluated whether replacing tenofovir disoproxil fumarate with tenofovir alafenamide (TAFLD), a drug with equivalent effectiveness and side effect profile, could reduce HIV treatment costs in South Africa. MethodsWe conducted a budget-impact analysis over 2026-2030 from the provider-perspective. The cost of antiretroviral treatment (ART) provision with either TLD or TAFLD was estimated using ingredients-based costing, including the cost of drugs, laboratory monitoring, staff, consumables, equipment and overheads. Costs are reported in 2025 USD, are undiscounted and not inflated. Population estimates for adults on first-line therapy were derived from Thembisa 4.8. We modelled a phased transition from TLD to TAFLD over two years, and explored sensitivity to TAFLD price variation ({+/-}15%) and inclusion of creatinine monitoring. ResultsTAFLD reduced per-patient annual costs by 4-5% compared with TLD (from US$178 to US$169, and US$287 to US$277, for first and follow-up years, respectively). At full replacement, total programme savings were approximately US$54 million per year (-5%). Even with continued creatinine monitoring, TAFLD remained cost-saving, reducing annual costs by around 4%. Savings increased to 8% if TAFLD prices were 15% lower than base-case assumptions. ConclusionsReplacing TDF with TAF in first-line antiretroviral therapy could generate meaningful cost savings for South Africa with minimal programme disruption. While long-term metabolic effects require consideration, TAFLD represents a feasible interim cost-reduction strategy while awaiting next-generation HIV therapies.
Liu, Z.; Liang, Y.; Wang, L. S.; Yu, J.; Liu, J.
Show abstract
Closed-form minimum sample size criteria for developing logistic prediction models, such as the Riley framework implemented in pmsampsize, are widely used but may become optimistic when anticipated discrimination is high. We conducted a Monte Carlo simulation study to compare the formula-based recommended development sample size, nRiley, with an empirical required sample size, nreq, defined by out-of-sample calibration-slope stability under repeated development sampling. Scenarios fixed the candidate parameter dimension at p = 10 and crossed predictor distribution (normal, standardized skewed continuous, binary), signal density (dense versus sparse), prevalence ({phi} [isin] {0.05, 0.10, 0.20}), and target discrimination (AUCtarget [isin] {0.70, 0.75, 0.80, 0.85, 0.90}), with intercept and signal strength calibrated to match targets. We defined nreq as the smallest n such that [E] (bn) [≥] 0.90 and Pr(bn < 0.80) [≤] 0.20, where bn is the truth-based logit-scale calibration slope evaluated on a large fixed validation covariate set. At moderate discrimination, nRiley approximated nreq, but as discrimination increased the formula increasingly underestimated the sample size required for calibration stability, with large deficits at AUCtarget = 0.90. Separation-like behavior (extreme fitted risks and linear predictors) at n = nRiley became common in high-discrimination settings despite nominal convergence, providing a plausible mechanism for formula optimism. These findings support augmenting formula-based planning with targeted simulation stress tests and instability diagnostics when high discrimination is anticipated.
fadikar, a.; Hotton, A.; de Lima, P. N.; Vardavas, R.; Collier, N.; Jia, K.; Rimer, S.; Khanna, A.; Schneider, J.; Ozik, J.
Show abstract
Detailed agent-based simulations are increasingly used to support policy decisions, but their computational cost and complex uncertainty structure make systematic scenario analysis challenging. We present a data-driven, uncertainty-aware decision support (DDUADS) workflow for using stochastic simulation models as decision-support tools under limited computational budgets. The approach combines several established techniques-sensitivity screening, Bayesian calibration using simulation-based inference, and multi-surrogate model integration for translational efficiency-into a coherent pipeline that enables uncertainty-aware policy analysis. Rather than producing a single baseline, the calibration stage yields a posterior distribution over plausible model parameterizations, allowing flexible, uncertainty-aware forward projections. We demonstrate the DDUADS workflow on the INFORM-HIV agent-based model of HIV transmission in Chicago to evaluate potential disruptions in antiretroviral therapy (ART) and pre-exposure prophylaxis (PrEP) use. While the specific application is HIV modeling, the challenges and techniques described here arise in other simulation studies and can be applied to decision support in other domains.
Ben-Joseph, J.
Show abstract
Lightweight epidemic calculators are widely used for teaching and rapid scenario exploration, yet many omit the methodological detail needed for scientific reuse. We present a browser-native SIR calculator that exposes forward Euler and classical fourth-order Runge--Kutta (RK4) integration alongside epidemiologically interpretable outputs and a population-conservation diagnostic. The implementation is anchored to analytical properties of the deterministic SIR system, including the epidemic threshold, the peak condition, and the final-size relation. Benchmark experiments show that RK4 is essentially step-size invariant over practical discretizations, whereas Euler at a coarse one-day step overestimates peak prevalence by 3.97% and final size by 0.66% relative to a fine-step RK4 reference. These results demonstrate that browser-based tools can support publication-quality computational narratives when solver choice, diagnostics, and assumptions are treated as first-class outputs.
Breeze, P. R.; Pidd, K.; Kalbus, A.; Cornelsen, L.; Brown, K. A.; Cummins, S.; Marks, D.; Law, C.; Smith, R.; Tanasache, O.; Er, V.; Forbes, C.; Brennan, A.
Show abstract
ObjectiveIn England, since 2022, large businesses providing food in the out-of-home sector are required to display calorie information for non-prepacked food and non-alcoholic drink items. This study estimates long-term cost-effectiveness of the policy by extrapolating real-world evidence on short-term policy effects in England. DesignThe lifetime health economic impacts of calorie labelling were simulated using a microsimulation model. The analysis adopted a health systems perspective to compare the policy with a counterfactual no intervention scenario. The policy may impact calories consumed through consumer behaviour changes and through menu changes based on observations from real-world evaluations. Estimated changes to daily calorie intake are translated to weight changes. Simulated outcomes include changes in obesity, diabetes cases, cardiovascular events, quality adjusted life years (QALYs) and National Health Service costs with probabilistic sensitivity analysis to describe uncertainty. SettingA synthetic population for England aged 13-79 combining data was generated the National Diet and Nutrition Survey (2009-19) and Health Survey for England (2018, 2019). ParticipantsNone ResultsThe policy is estimated to generate lifetime cost savings were estimated to be-{pound}9.15 (95% CI - {pound}31.63, {pound}2.50), and incremental QALYs 0.0021 (95% CI-0.0008, 0.0048) per person. The incremental net benefit at {pound}20,000 per QALY was {pound}50.23 (95% CI-{pound}16.41, {pound}74.68). Greater cost-savings and QALY gains were observed in the most deprived groups. DiscussionThe out-of-home calorie labelling policy in England is uncertain but most likely cost-effective with cost-savings and marginally beneficial to health. The results are driven by expected menu changes.